Movie Collection Analysis

Load packages

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

from IPython.display import Image

%matplotlib inline

Load files

In [4]:
pwd
Out[4]:
'C:\\Users\\Min\\Jupyter\\movie_record'
In [5]:
# load csv files as dataframes
df_writer = pd.read_csv("original_data\\Movie_Writer.csv")
df_actor = pd.read_csv("original_data\\Movie_Actors.csv")
df_rating = pd.read_csv("original_data\\Movie_AdditionalRating.csv")
df_movie = pd.read_csv("original_data\\Movie_Movies.csv")
df_genre = pd.read_csv("original_data\\Movie_Genres.csv")
C:\Users\Min\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3058: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Data Exploration and Cleaning for importing data into SQL

First I do a very simple exploratory analysis to understand how all the data is structued and related and I also clean the data in order to put the data into SQL

Dimensionality and raw data check

In [6]:
for i in [df_writer, df_actor, df_rating, df_movie, df_genre]:
    print(i.shape)
(66164, 4)
(143869, 3)
(92016, 4)
(178687, 18)
(308565, 3)

Movie_Writer.csv Data

In [7]:
df_writer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66164 entries, 0 to 66163
Data columns (total 4 columns):
Unnamed: 0        66164 non-null int64
Person            66164 non-null object
Responsibility    66164 non-null object
imdbID            66164 non-null object
dtypes: int64(1), object(3)
memory usage: 2.0+ MB
In [8]:
df_writer.head()
Out[8]:
Unnamed: 0 Person Responsibility imdbID
0 0 Ernest Vajda story tt0016750
1 1 Ben F. Wilson scenario tt0328410
2 2 Sam-yuk Yoon screenplay tt0366512
3 3 Judy Rothman Rofé screenplay tt0206367
4 4 E.B. White book tt0206367
In [9]:
df_writer['Responsibility'].value_counts()
Out[9]:
screenplay                 17454
story                      17106
novel                       3814
scenario                    1983
play                        1863
                           ...  
based upon a book by           1
inspired from a story          1
dialogue & scenario            1
Peter Pan created by           1
story "The Rival Dummy"        1
Name: Responsibility, Length: 2424, dtype: int64
In [10]:
df_writer[df_writer['imdbID'].duplicated()]
Out[10]:
Unnamed: 0 Person Responsibility imdbID
4 4 E.B. White book tt0206367
6 6 Steve Fisher screenplay tt0046287
8 8 Billy Wilder story tt0025597
12 12 Tama Janowitz screenplay tt0098347
15 15 Leonard Neubauer screenplay tt0069796
... ... ... ... ...
66157 66157 John Logan screenplay tt0146838
66158 66158 Oliver Stone screenplay tt0146838
66160 66160 David Howard screenplay tt0177789
66161 66161 Robert Gordon screenplay tt0177789
66163 66163 Anthony Minghella screenplay tt0134119

32821 rows × 4 columns

Movie_Actors

In [11]:
df_actor.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143869 entries, 0 to 143868
Data columns (total 3 columns):
Unnamed: 0    143869 non-null int64
Actors        143869 non-null object
imdbID        143869 non-null object
dtypes: int64(1), object(2)
memory usage: 3.3+ MB
In [12]:
df_actor.head()
Out[12]:
Unnamed: 0 Actors imdbID
0 0 Cass Barbera tt2268369
1 2 Charles A. Post tt0016750
2 3 Samuel Green tt3405286
3 4 Surin Mendis tt3816698
4 7 Baby Rube tt0328410
In [13]:
df_actor[df_actor['imdbID'].duplicated()]
Out[13]:
Unnamed: 0 Actors imdbID

Movie_AdditionalRating

In [14]:
df_rating.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92016 entries, 0 to 92015
Data columns (total 4 columns):
Unnamed: 0      92016 non-null int64
Rating          92016 non-null object
RatingSource    92016 non-null object
imdbID          92016 non-null object
dtypes: int64(1), object(3)
memory usage: 2.8+ MB
In [15]:
df_rating.head()
Out[15]:
Unnamed: 0 Rating RatingSource imdbID
0 0 6.9/10 Internet Movie Database tt1006835
1 1 2.4/10 Internet Movie Database tt2554714
2 2 6.3/10 Internet Movie Database tt1090670
3 3 5.5/10 Internet Movie Database tt1990255
4 4 5.0/10 Internet Movie Database tt0206367
In [16]:
df_rating[df_rating['imdbID'].duplicated()]
Out[16]:
Unnamed: 0 Rating RatingSource imdbID
5 5 15% Rotten Tomatoes tt0206367
6 6 27/100 Metacritic tt0206367
14 14 13% Rotten Tomatoes tt0098347
34 34 40% Rotten Tomatoes tt2140619
35 35 45/100 Metacritic tt2140619
... ... ... ... ...
92011 15447 52% Rotten Tomatoes tt0145653
92012 15448 54/100 Metacritic tt0145653
92013 15449 7.6/10 Internet Movie Database tt0174856
92014 15450 83% Rotten Tomatoes tt0174856
92015 15451 74/100 Metacritic tt0174856

29622 rows × 4 columns

In [17]:
df_rating[df_rating['imdbID'].duplicated()]
Out[17]:
Unnamed: 0 Rating RatingSource imdbID
5 5 15% Rotten Tomatoes tt0206367
6 6 27/100 Metacritic tt0206367
14 14 13% Rotten Tomatoes tt0098347
34 34 40% Rotten Tomatoes tt2140619
35 35 45/100 Metacritic tt2140619
... ... ... ... ...
92011 15447 52% Rotten Tomatoes tt0145653
92012 15448 54/100 Metacritic tt0145653
92013 15449 7.6/10 Internet Movie Database tt0174856
92014 15450 83% Rotten Tomatoes tt0174856
92015 15451 74/100 Metacritic tt0174856

29622 rows × 4 columns

Movie_Movies

In [18]:
df_movie.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178687 entries, 0 to 178686
Data columns (total 18 columns):
Awards        20340 non-null object
Country       165998 non-null object
DVD           15949 non-null object
Director      151353 non-null object
Language      153196 non-null object
Plot          97476 non-null object
Poster        48414 non-null object
Production    16446 non-null object
Rated         23906 non-null object
Released      116974 non-null object
Runtime       119497 non-null object
Title         178686 non-null object
Type          178686 non-null object
Website       6747 non-null object
Year          178686 non-null object
imdbID        178686 non-null object
imdbRating    62073 non-null float64
imdbVotes     62029 non-null object
dtypes: float64(1), object(17)
memory usage: 24.5+ MB
In [19]:
df_movie.head(3)
Out[19]:
Awards Country DVD Director Language Plot Poster Production Rated Released Runtime Title Type Website Year imdbID imdbRating imdbVotes
0 NaN USA NaN Rose Cummings English Rachel constantly hears her baby cry from the ... NaN NaN NaN 26 Apr 2012 20 min Baby's Breath movie NaN 2012 tt2268369 NaN NaN
1 NaN USA NaN James Byrne NaN The struggle against unfortunate circumstances... NaN NaN NaN NaN 9 min Winter Trees movie NaN 2008 tt1560760 NaN NaN
2 NaN USA NaN Dimitri Buchowetzki NaN NaN NaN NaN NaN 27 Mar 1926 50 min The Crown of Lies movie NaN 1926 tt0016750 NaN NaN
In [20]:
df_movie.tail(3)
Out[20]:
Awards Country DVD Director Language Plot Poster Production Rated Released Runtime Title Type Website Year imdbID imdbRating imdbVotes
178684 Won 1 Golden Globe. Another 4 wins & 23 nomina... UK, Germany, Japan, USA 30 May 2000 Milos Forman English The life and career of a legendary comedian, A... https://ia.media-imdb.com/images/M/MV5BNDI1Mjc... Universal R 22 Dec 1999 118 min Man on the Moon movie http://www.universalpictures.com/manonthemoon 1999 tt0125664 7.4 110,566
178685 7 wins & 14 nominations. USA 02 May 2000 Dean Parisot English The alumni cast of a space opera television se... https://ia.media-imdb.com/images/M/MV5BNmZlNTY... DreamWorks SKG PG 25 Dec 1999 102 min Galaxy Quest movie http://www.amazon.com/exec/obidos/subst/video/... 1999 tt0177789 7.3 132,448
178686 Nominated for 5 Oscars. Another 10 wins & 75 n... USA 27 Jun 2000 Anthony Minghella English, Italian In late 1950s New York, Tom Ripley, a young un... https://images-na.ssl-images-amazon.com/images... Paramount Pictures R 25 Dec 1999 139 min The Talented Mr. Ripley movie http://www.talentedmrripley.com 1999 tt0134119 7.4 151,938
In [21]:
df_movie['Type'].value_counts()
Out[21]:
movie     178665
series        21
Name: Type, dtype: int64
In [22]:
df_movie['DVD'].value_counts()
Out[22]:
29 Mar 2005    82
31 May 2005    55
01 Jan 1998    50
01 Jun 2004    49
08 Mar 2005    49
               ..
28 Sep 1988     1
31 Jan 1995     1
05 Jun 2009     1
09 Jan 1996     1
20 Aug 1987     1
Name: DVD, Length: 2586, dtype: int64
In [23]:
df_movie[df_movie['imdbID'].duplicated()]
Out[23]:
Awards Country DVD Director Language Plot Poster Production Rated Released Runtime Title Type Website Year imdbID imdbRating imdbVotes

Movie_Genres

In [24]:
df_genre.head()
Out[24]:
Unnamed: 0 Genre imdbID
0 0 Short tt2268369
1 1 Drama tt2268369
2 2 Short tt1560760
3 3 Drama tt1560760
4 4 Drama tt0016750
In [25]:
df_genre.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308565 entries, 0 to 308564
Data columns (total 3 columns):
Unnamed: 0    308565 non-null int64
Genre         308565 non-null object
imdbID        308565 non-null object
dtypes: int64(1), object(2)
memory usage: 7.1+ MB
In [26]:
df_genre['Genre'].value_counts()
Out[26]:
Short           56867
Documentary     34851
 Drama          29726
 Short          22473
 Comedy         20049
Drama           17228
Comedy          15063
Adult           12135
 Romance         7653
Animation        6926
Action           5933
 Music           5889
 Thriller        5816
 Horror          5312
 Family          5082
 Adventure       4048
 Crime           3979
 Fantasy         3914
 Biography       3602
 Action          3158
 Mystery         3157
 History         2835
 Sci-Fi          2829
Crime            2805
Horror           2800
Music            2609
Adventure        2291
 Sport           1764
 War             1533
Family           1438
 Musical         1408
 Western         1400
Thriller         1264
Biography        1229
Sport            1187
 News             873
Romance           803
Western           766
Reality-TV        687
 Animation        609
Musical           605
Mystery           598
Sci-Fi            494
 Documentary      489
Fantasy           486
 Adult            420
News              267
History           252
Talk-Show         240
 Reality-TV       193
 Film-Noir        149
 Talk-Show        144
War               121
Game-Show          59
 Game-Show         51
Film-Noir           6
Name: Genre, dtype: int64
In [27]:
df_genre[df_genre['imdbID'].duplicated()]
Out[27]:
Unnamed: 0 Genre imdbID
1 1 Drama tt2268369
3 3 Drama tt1560760
5 5 Romance tt0016750
7 7 Short tt3405286
9 9 Drama tt3816698
... ... ... ...
308558 317235 Drama tt0125664
308560 317237 Comedy tt0177789
308561 317238 Sci-Fi tt0177789
308563 317240 Drama tt0134119
308564 317241 Thriller tt0134119

138555 rows × 3 columns

Data Cleaing

Remove "Unnamed: 0" Column from df_writer, df_actor, df_rating, df_genre
In [28]:
df_writer.drop("Unnamed: 0", axis = 1, inplace = True)
df_actor.drop("Unnamed: 0", axis = 1, inplace = True)
df_rating.drop("Unnamed: 0", axis = 1, inplace = True)
df_genre.drop("Unnamed: 0", axis = 1, inplace = True)
Save the cleaned files as csv file again.
In [29]:
df_writer.to_csv("cleaned_data\\movie_writer_2.csv", index=False)
df_actor.to_csv("cleaned_data\\movie_actor_2.csv", index=False)
df_rating.to_csv("cleaned_data\\movie_rating_2.csv", index=False)
df_genre.to_csv("cleaned_data\\movie_genre_2.csv", index=False)

All dataframes have imdbID. So it can be used as primary key

The main table will be df_movie. In order for imdbID to be Primary Key, os there shouldn't be any duplicates in rows and NaN or NA in imdbID column.

Does imdbID have duplicates or any Na/NaN rows?

In [30]:
df_movie[df_movie.duplicated()]
Out[30]:
Awards Country DVD Director Language Plot Poster Production Rated Released Runtime Title Type Website Year imdbID imdbRating imdbVotes

No duplicates.

In [31]:
df_movie[df_movie['imdbID'].isnull()]
Out[31]:
Awards Country DVD Director Language Plot Poster Production Rated Released Runtime Title Type Website Year imdbID imdbRating imdbVotes
123293 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

One dummy row.

Remove the dummy row.
In [32]:
df_movie = df_movie[df_movie['imdbID'].notnull()]
Save the cleaned file as csv file again.
In [33]:
df_movie.to_csv("cleaned_data\\movie_movies_2.csv", index=False)

There are five data frames in total. The column every data frame has is imdbID. The main table for this data should be Movie_Movies.csv because it has one-to-one match between Title and imdbID. I am going to put the data into SQL now.

In [34]:
Image(filename='movie_record.png') 
Out[34]:

Import all the data into MS-SQL Server

Diagram of the database

In [35]:
Image(filename='tables.png') 
Out[35]:
In [36]:
Image(filename='diagram.png') 
Out[36]:

Exploratory Data Analysis (EDA)

Quation1: How many movies the Top10 directors in the data set produced?

Load data from SQL database

In [37]:
conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=DESKTOP-MB0G47L;'
                      'Database=test;'
                      'Trusted_Connection=yes;')
In [38]:
query_first = "SELECT Director, Title, imdbID FROM dbo.movie_movie_2 WHERE Type = 'Movie'"  # only moive not series
In [39]:
df_first = pd.read_sql(query_first, conn)

Is there any None value in Director and Title columns?

In [40]:
df_first[df_first['Title'].isnull()]
Out[40]:
Director Title imdbID
In [41]:
df_first[df_first['Director'].isnull()]
Out[41]:
Director Title imdbID
16 None Niagara Falls tt0000162
24 None The Artist and the Flower Girl tt0000220
25 None Boat Race tt0000226
29 None Buffalo Bill's Wild West Parade tt0000279
30 None Buffalo Bill's Wild West Parade tt0000280
... ... ... ...
178645 None He Lied About Everything tt8013990
178648 None Warren Jeffs: Prophet of Evil tt8019406
178652 None The Force of Sound tt8037350
178658 None Girls Incarcerated tt8092942
178661 None Gad Elmaleh: American Dream tt8110630

27312 rows × 3 columns

Remove the None rows

In [42]:
df_first = df_first[df_first['Director'].notnull()]

Visual check

In [43]:
Image(filename='SQL1.png')
Out[43]:

We can see some rows with more than one names.

Explode , separated directors

In [44]:
df_first['Director'] = df_first['Director'].str.split(",") # make it as a list - comma separated
In [45]:
df_first = df_first.explode('Director') # explode the lists

Have new rows created?

In [46]:
df_first[df_first['Title'].duplicated()]
Out[46]:
Director Title imdbID
0 Louis Lumière The Arrival of a Train tt0000012
41 Charles Urban Reproduction, Coronation Ceremonies: King Edwa... tt0000387
46 Edwin S. Porter Life of an American Fireman tt0000447
64 G.W. Bitzer The Adventures of Dollie tt0000628
166 Theodore Wharton The Girl from Arizona tt0001232
... ... ... ...
178649 Greg Watkins Dog Story tt8020796
178649 Caveh Zahedi Dog Story tt8020796
178663 Frédéric Schuld Carlotta's Face tt8145470
178664 Charles Steven Duran(co-director) Easy Money tt8200864
178664 La La King(co-director) Easy Money tt8200864

34801 rows × 3 columns

Remove (co-director) from elements of the Director column

In [47]:
df_first['Director'] = df_first['Director'].str.replace(r"\(.*\)","")

Does space and capital/lower affect analysis?

Current result

In [48]:
df_first['Director'].value_counts().head(10)
Out[48]:
Jim Powers                             156
Dave Fleischer                         109
D.W. Griffith                          103
Lewin Fitzhamon                         95
Al Christie                             90
Georges Méliès                          87
Gilbert M. 'Broncho Billy' Anderson     85
Kevin Dunn                              85
Quasarman                               84
James H. White                          78
Name: Director, dtype: int64
Capital or lower case?
In [49]:
df_first['Director'].str.lower().value_counts().head(10)
Out[49]:
jim powers                             156
dave fleischer                         109
d.w. griffith                          103
lewin fitzhamon                         95
al christie                             90
georges méliès                          87
kevin dunn                              85
gilbert m. 'broncho billy' anderson     85
quasarman                               84
james h. white                          78
Name: Director, dtype: int64
Any white space?
In [50]:
df_first['Director'].str.strip().value_counts().head(10)
Out[50]:
Jim Powers                             157
Dave Fleischer                         109
D.W. Griffith                          104
Lewin Fitzhamon                         95
Al Christie                             90
Georges Méliès                          88
Kevin Dunn                              86
Gilbert M. 'Broncho Billy' Anderson     85
Quasarman                               85
James H. White                          83
Name: Director, dtype: int64

White space affect the reult. Remove whitespace

In [51]:
df_first['Director'] = df_first['Director'].str.strip()

Are there duplicates in Director and Title columns?

In [52]:
df_first[df_first.duplicated()]
Out[52]:
Director Title imdbID
47184 Jonathan Richardson Stone Tears tt0399699
53386 Charles Portney The Shop Below the Busy Road tt0454685
68940 Kerry Weber Vegan Cooking for Animal Lovers tt1116856
73534 Olympia Stone The Collector: Allan Stone's Life in Art tt1262021
79561 Greg Lavin Little Sister tt1414838
... ... ... ...
177370 James Button The Corpse Series tt6935504
177375 Victoria Sutton Cyborgs. Should We Be Better Than We Are? tt6942256
177768 Nick Alexander Shed Tears for the River tt7162110
177843 James Redford Happening: A Clean Energy Revolution tt7212266
177911 Kevin Merz Gotthard: One Life, One Soul tt7245696

116 rows × 3 columns

Drop duplicates

In [53]:
df_first.drop_duplicates(keep='first', inplace =True)

Doesn't affect the result.

Top 10 directors by movies produced

In [54]:
names = list(df_first.groupby('Director')['Title'].count().sort_values(ascending = False).head(10).index)
movies = list(df_first.groupby('Director')['Title'].count().sort_values(ascending = False).head(10).values)

plt.figure(figsize=(14,6));
df_first.groupby('Director')['Title'].count().sort_values(ascending = False).head(10)[::-1].plot.barh(color=(0.9, 0.9, 0.9, 0.9), edgecolor='blue');
plt.title("Top10 directors by moives produced", fontsize=20);
plt.xlabel('Number of Movies', fontsize=14);
plt.ylabel('Directors', fontsize=14);

for count, i in enumerate(movies[::-1]):
    plt.text(i, count, str(i),  ha='left', va='center');

    plt.grid(True);
    
print(df_first.groupby('Director')['Title'].count().sort_values(ascending = False).head(10))
Director
Jim Powers                             157
Dave Fleischer                         109
D.W. Griffith                          104
Lewin Fitzhamon                         95
Al Christie                             90
Georges Méliès                          88
Kevin Dunn                              86
Quasarman                               85
Gilbert M. 'Broncho Billy' Anderson     85
James H. White                          83
Name: Title, dtype: int64

The director who produced the most movies is Jim Powers with 157 moives. He is incomparable to any one on the Top 10 list. The second and third ranked directors are Dave Fleischer and D.W. Griffth with 109 and 104, respectively. The numbers of the 2nd and 3rd show a huge difference to the Top 1 director Jim Powers. So when it comes to making movies, Jim Powers has an extraordinary achievement. In addition, the directors from Top4 to Top10 have produced quite a similar number of moives, ranging from 83 to 95, showing a meaning gap between the Top3 group and the rest of Top 10 directors.

Quastion2: What kind of genre the directors in Top 10 list from Q1 make?

In [55]:
query_second = "SELECT dbo.movie_movie_2.imdbID, Director, Genre \
FROM dbo.movie_movie_2 \
JOIN dbo.movie_genre_2 \
ON movie_movie_2.imdbID = dbo.movie_genre_2.imdbID \
WHERE Director IS NOT NULL" 
In [56]:
df_second = pd.read_sql(query_second, conn)
In [57]:
df_second.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269324 entries, 0 to 269323
Data columns (total 3 columns):
imdbID      269324 non-null object
Director    269324 non-null object
Genre       269324 non-null object
dtypes: object(3)
memory usage: 6.2+ MB
In [58]:
df_second.head()
Out[58]:
imdbID Director Genre
0 tt2268369 Rose Cummings Short
1 tt2268369 Rose Cummings Drama
2 tt1560760 James Byrne Short
3 tt1560760 James Byrne Drama
4 tt0016750 Dimitri Buchowetzki Drama

Use the codes above in Q1 for cleaning the Director column

In [59]:
def name_cleaner(df):  
    df['Director'] = df['Director'].str.split(",") # make it as a list - comma separated
    df = df.explode('Director') # explode the lists
    df['Director'] = df['Director'].str.replace(r"\(.*\)","")
    df['Director'] = df['Director'].str.strip()
    df.drop_duplicates(keep='first', inplace =True)
    return df
In [60]:
df_second_new = name_cleaner(df_second)
Pivot table for counts movies genre by director
In [61]:
df_second_new = df_second_new.pivot_table(index='Director', columns='Genre',  values='imdbID', aggfunc= lambda x: len(x))
In [62]:
df_second_new.reset_index(inplace=True)
In [63]:
df_second_new.head(3)
Out[63]:
Genre Director Action Adult Adventure Animation Biography Comedy Crime Documentary Drama ... News Reality-TV Romance Sci-Fi Short Sport Talk-Show Thriller War Western
0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 'Big' James Wroblewski NaN NaN NaN NaN NaN NaN NaN NaN 1.0 ... NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2 'Chico' Hernandez NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 29 columns

Get the top 10 list from Q1
In [64]:
df_top_ten = df_first.groupby('Director', as_index=False)['Title'].count().sort_values(ascending=False, by='Title').head(10)
In [65]:
df_top_ten.reset_index(drop=True, inplace=True)
In [66]:
df_top_ten.columns = ['Director', 'Counts']

Merge two data frame df_top_ten and df_second_new - Inner Join

In [67]:
df_top_ten
Out[67]:
Director Counts
0 Jim Powers 157
1 Dave Fleischer 109
2 D.W. Griffith 104
3 Lewin Fitzhamon 95
4 Al Christie 90
5 Georges Méliès 88
6 Kevin Dunn 86
7 Quasarman 85
8 Gilbert M. 'Broncho Billy' Anderson 85
9 James H. White 83
Merge two dataframe
In [68]:
df_merged = df_top_ten.merge(df_second_new, left_on='Director', right_on = 'Director')
In [69]:
df_merged.head()
Out[69]:
Director Counts Action Adult Adventure Animation Biography Comedy Crime Documentary ... News Reality-TV Romance Sci-Fi Short Sport Talk-Show Thriller War Western
0 Jim Powers 157 1.0 152.0 NaN NaN NaN 3.0 1.0 1.0 ... NaN NaN 4.0 NaN 3.0 NaN NaN NaN NaN NaN
1 Dave Fleischer 109 NaN NaN NaN 109.0 NaN 57.0 NaN NaN ... NaN NaN NaN NaN 102.0 NaN NaN NaN NaN NaN
2 D.W. Griffith 104 7.0 NaN NaN NaN 1.0 19.0 4.0 1.0 ... NaN NaN 13.0 NaN 94.0 NaN NaN NaN 4.0 8.0
3 Lewin Fitzhamon 95 1.0 NaN 2.0 NaN NaN 54.0 13.0 NaN ... NaN NaN 4.0 NaN 94.0 NaN NaN NaN 2.0 1.0
4 Al Christie 90 NaN NaN 4.0 NaN NaN 79.0 NaN NaN ... NaN NaN 3.0 NaN 89.0 NaN NaN NaN NaN 7.0

5 rows × 30 columns

In [70]:
fig, ([ax1, ax2], [ax3, ax4]) = plt.subplots(2, 2, figsize=(16,16))

ax1.pie(df_merged.iloc[0][2:][df_merged.iloc[0][2:].notnull()], labels=df_merged.iloc[0][2:][df_merged.iloc[0][2:].notnull()].index)
ax1.set_title("Jim Powers Top1")
ax2.pie(df_merged.iloc[1][2:][df_merged.iloc[1][2:].notnull()], labels=df_merged.iloc[1][2:][df_merged.iloc[1][2:].notnull()].index)
ax2.set_title("Dave Fleischer Top2")
ax3.pie(df_merged.iloc[2][2:][df_merged.iloc[2][2:].notnull()], labels=df_merged.iloc[2][2:][df_merged.iloc[2][2:].notnull()].index)
ax3.set_title("D.W. Griffith Top3")
ax4.pie(df_merged.iloc[3][2:][df_merged.iloc[3][2:].notnull()], labels=df_merged.iloc[3][2:][df_merged.iloc[3][2:].notnull()].index)
ax4.set_title("Lewin Fitzhamon Top4")
plt.show()

I will take a look at only first four directos from the top. Interestingly, it seems that the Top1 director Jim Powers is an adult film director. Most of his films are in the adult gerne. In contrast, as for the Top 2 director, he's a film director for children as his films are mostly related to Animation and Comedy. For the rest of the directors (D.W. Griffith and Lewin Fitzhamon), it is not so obvious what their films are about. I assume that their work is relatively diverse than first two directos.

Quastion3: Can we see a hype of specific genre over the past years?

In [71]:
query_third = "SELECT DISTINCT movie_movie_2.imdbID, [YEAR], Genre \
FROM dbo.movie_movie_2 \
FULL JOIN dbo.movie_genre_2 \
ON dbo.movie_movie_2.imdbID = movie_genre_2.imdbID \
WHERE Genre IS NOT NULL" 
In [72]:
df_third = pd.read_sql(query_third, conn)
In [73]:
df_third.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308565 entries, 0 to 308564
Data columns (total 3 columns):
imdbID    308565 non-null object
YEAR      308565 non-null object
Genre     308565 non-null object
dtypes: object(3)
memory usage: 7.1+ MB
In [74]:
df_third.head(3)
Out[74]:
imdbID YEAR Genre
0 tt0000090 1896 Documentary
1 tt0000099 1896.0 Short
2 tt0000179 1898 Short
In [75]:
df_third['Genre'].value_counts()
Out[75]:
Short          79340
Drama          46954
Documentary    35340
Comedy         35112
Adult          12555
Action          9091
Music           8498
Romance         8456
Horror          8112
Animation       7535
Thriller        7080
Crime           6784
Family          6520
Adventure       6339
Biography       4831
Fantasy         4400
Mystery         3755
Sci-Fi          3323
History         3087
Sport           2951
Western         2166
Musical         2013
War             1654
News            1140
Reality-TV       880
Talk-Show        384
Film-Noir        155
Game-Show        110
Name: Genre, dtype: int64
In [76]:
df_third['YEAR'].value_counts().tail(50)
Out[76]:
1963.0       105
1946.0       104
1950.0       104
1931.0       103
1902.0       102
1926.0       102
1932.0       102
1934.0       102
1904.0       102
1943.0       101
1906.0       101
1905.0       101
1942.0        99
1940.0        99
1898.0        99
1937.0        98
1948.0        98
1962.0        95
1941.0        94
1923.0        93
1958.0        92
1947.0        91
1960.0        88
1930.0        88
1959.0        86
1951.0        86
1954.0        80
1952.0        75
1899.0        73
1945.0        71
2020          44
1897.0        43
1896.0        34
1895          31
2019.0        14
1894          14
2021          13
2021.0         7
1895.0         5
1894.0         4
2023           4
2020.0         4
2007–2011      3
1890           3
2003–2005      3
2010–2013      3
2012–2016      3
1889           3
2022           2
1891.0         2
Name: YEAR, dtype: int64
Remove unreasonable data 2021, 2021.0, 2022, 2010-2013, 2007-2011, 2003-2005, 2012-2016, 2003-2005
In [77]:
df_third = df_third[(df_third['YEAR'] != '2021') & (df_third['YEAR'] != '2021.0') & (df_third['YEAR'] != '2022') & (df_third['YEAR'] != '2010–2013') & (df_third['YEAR'] != '2007–2011') & (df_third['YEAR'] != '2003–2005') & (df_third['YEAR'] != '2012–2016') & (df_third['YEAR'] != '2023')]
Change data type of YEAR into int
In [78]:
df_third['YEAR'] = df_third['YEAR'].astype(float).astype(int)

Since the data is upto 2017, we will not consider the data after 2017

In [79]:
df_third = df_third[df_third['YEAR'] <= 2017]

Quick overview of the total number of moives by year

In [80]:
plt.figure(figsize=(17,8));
df_third.groupby(['YEAR'])['imdbID'].count().plot.bar();
plt.ylabel('Counts', fontsize=14);
plt.xlabel('Year', fontsize=14);
plt.title('Movie by year', fontsize=18);

Change the dataframe into plotable data using pivot table.

In [81]:
df_third_new = df_third.pivot_table(index='YEAR', columns='Genre',  values='imdbID', aggfunc= lambda x: len(x))

Simple information for the created dataframe.

In [82]:
df_third_new.head(3)
Out[82]:
Genre Action Adult Adventure Animation Biography Comedy Crime Documentary Drama Family ... News Reality-TV Romance Sci-Fi Short Sport Talk-Show Thriller War Western
YEAR
1889 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN
1890 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN
1891 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN

3 rows × 28 columns

In [83]:
df_third_new.describe()
Out[83]:
Genre Action Adult Adventure Animation Biography Comedy Crime Documentary Drama Family ... News Reality-TV Romance Sci-Fi Short Sport Talk-Show Thriller War Western
count 114.000000 63.000000 114.000000 111.000000 104.000000 122.000000 118.000000 127.000000 122.000000 109.000000 ... 68.000000 49.000000 113.000000 90.000000 127.000000 123.000000 40.000000 106.000000 118.000000 113.000000
mean 78.359649 199.285714 54.798246 67.468468 46.269231 286.278689 57.050847 277.464567 382.106557 59.422018 ... 16.750000 17.897959 74.362832 36.300000 623.724409 23.967480 9.550000 65.575472 13.949153 19.079646
std 139.877008 227.403324 92.333484 96.764928 118.994267 481.322408 83.902553 519.077429 700.801853 109.276687 ... 35.568884 24.409907 111.004590 64.539414 1307.753840 39.239478 13.217995 127.181906 15.025248 18.850512
min 1.000000 1.000000 1.000000 1.000000 1.000000 5.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 9.000000 12.000000 15.000000 24.500000 2.000000 71.500000 16.250000 38.500000 90.000000 11.000000 ... 1.000000 2.000000 25.000000 3.250000 105.000000 4.500000 1.000000 5.250000 5.000000 6.000000
50% 21.000000 114.000000 21.500000 31.000000 5.500000 114.500000 28.000000 86.000000 140.500000 19.000000 ... 2.000000 5.000000 31.000000 10.500000 153.000000 9.000000 3.500000 15.000000 8.500000 15.000000
75% 80.000000 315.000000 38.500000 53.500000 22.750000 221.250000 44.000000 170.000000 279.500000 46.000000 ... 10.250000 22.000000 57.000000 32.000000 343.000000 19.500000 12.500000 65.500000 15.750000 24.000000
max 655.000000 774.000000 461.000000 459.000000 616.000000 2310.000000 421.000000 2372.000000 3657.000000 552.000000 ... 149.000000 73.000000 520.000000 314.000000 6757.000000 172.000000 50.000000 640.000000 66.000000 106.000000

8 rows × 28 columns

This time I am going to use plotly for a interative diagram.

In [84]:
import dash
import plotly.graph_objects as go
import plotly.offline as py
Make a list for plot data
In [85]:
data_bar_list = list(df_third_new)
Make a list of plot data using list comprehension
In [86]:
data_bar = [go.Bar(
                    x=df_third_new.index,
                    y=df_third_new[genre].values,
                    name=genre,
                 ) for genre in data_bar_list]

Layout for the plot

In [87]:
layout_bar = go.Layout(
                    title='Movie Genres by Year',
                    barmode='stack',
                    yaxis_title='Number of movies',
                    xaxis_title='Year',
                    legend=dict(
                                x=-0.1,
                                y=-0.1,
                                orientation="h")
                        )

fig_bar = go.Figure(data=data_bar, layout=layout_bar)
In [88]:
fig_bar.update_layout(
                autosize=False,
                width=1000,
                height=700
                )
In [90]:
fig_bar.write_html("file.html")

The number of films produced per year has been increasing since 1920. One interesting fact from the plot above is that, for some reason, there was a short film boom between 1905 and 1919. As can be seen, there was a peak around 1910, then it drastically decreased. It seems that there might be an negative event that affected the film industry. So the early development of the film industry was ended due to the event. Since then the movie industry had a long plateau until 1980s. However, all of a sudden in 1990s the industry started booming and the number of films has been exponentially increasing until 2015. The most noticable genres in trend are drama, comedy and documentary (except for short), while one of the decreasing gernes is adult.